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Hsuan-Tien Lin (NTU CSIE) 


Linear Regression 
Roadmap 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 










Lecture 8: Noise and Error 


learning can happen 
with target distribution P(y|x) and low Ein w.r.t. err 





Q How Can Machines Learn? 


Lecture 9: Linear Regression 


e Linear Regression Problem 

e Linear Regression Algorithm 

e Generalization Issue 

e Linear Regression for Binary Classification 





@ How Can Machines Learn Better? 
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Linear Regression 





unknown target function 
f:X—y 


(ideal credit limit formula) 


| 


training examples 
D: (X1, yi) ‚ (Хм, Ум) 


(historical records т bank) 














H 


hypothesis set 


Linear Regression Problem 


Credit Limit Problem 







































age 23 years 
gender female 
annual salary NTD 1,000,000 
year in residence 1 year 
year in job 0.5 year 
current debt 200,000 














learning 
algorithm 
A 





(set of candidate formula) 


credit limit? 100,000 














final hypothesis 
gef 











(‘learned’ formula to be used) 


y = R: regression | 
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Цпеаг Regression Linear Regression Problem 


Linear Regression Hypothesis 














age 23 years 
annual salary | NTD 1,000,000 

year in job 0.5 year 

current debt 200,000 

















e For X = (Xo, X1, х, ··· , Ха) ‘features of customer’, 
approximate the desired credit limit with a weighted sum: 


d 
Jo 58 WiXi 
i=0 


e linear regression hypothesis: h(x) = w! x 





h(x): like perceptron, but without the sign | 
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Ипеаг Regression Linear Regression Problem 


Illustration of Linear Regression 


T2 





linear regression: 
find lines/hyperplanes with small residuals 





Linear Regression Linear Regression Problem 


The Error Measure 


popular/historical error measure: 
squared error err(f, у) = (ў — у)? | 


out-of-sample 







Eou(W) = x E (МХ = Ија 





next: how to minimize Ej,(w)? J 
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Linear Regression Linear Regression Problem 


Fun Time 





Consider using linear regression hypothesis h(x) = w!x to 
predict the credit limit of customers x. Which feature below shall 
have a positive weight in a for the task? 


@ birth month 
@ monthly income 
@ current debt 
@ number of credit cards owned 


















Reference Answer: (2) 


Customers with higher monthly income should 
naturally be given a higher credit limit, which is 
captured by the positive weight on the 'monthly 
income' feature. 
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Linear Regression Linear Regression Algorithm 


Matrix Form of Ei (W) 







N N 
1 1 
En(w) = N wx, – уп) = М xw — уп)? 
п=1 п=1 
хлм– у 
ee x W — Yo 
М dos 
XIW — VN 
2 
--X4--] [у 
== z zx DE | 
MET 





Nxd--1 d--1x1 Nx1 


Цпеаг Regression Linear Regression Algorithm 


min Ein(W) = 1 хау — у? 


e Ei (w): continuous, differentiable, convex 
e necessary condition of 'best' w 

















Ein on (w) 0 
VEn(w) = | 9) |= ° 
м Fat (w) 0 


—not possible to ‘roll down’ 





task: find Wi iN such that УМЕ (Мм) =0 | 
HumTenus OU) “i ааа... 


Цпеаг Regression Linear Regression Algorithm 


The Gradient V En(w) 
En(w) = 1 хау – yl? = 1 W -2w' y vy) 
A b с 












Ein(w)= jj (wTAw — 2w b 4- c) 
УЕ. (м) = 1 (2Aw — 2b) 


simple! :-) similar (derived by definition) 





VEin(w) = 5 (X’Xw – ХТу) 





Цпеаг Regression Linear Regression Algorithm 


Optimal Linear Regression Weights 
task: find мл such that 2 (X"Xw — X'y) = VEjn(w) = 0 






e easy! unique solution e many optimal solutions 


24 e one of the solutions 
WI IN == (х7х) X y 
— Ír ,À Мм = X'y 
pseudo-inverse xt 
by defining X! in other ways 
e often the case because 
N>d+1 





practical suggestion: 
use well-implemented | routine 
instead of (ЕУ ХТ 
for numerical stability when almost-singular 
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Linear Regression Linear Regression Algorithm 


Linear Regression Algorithm 


@ from 2, construct input matrix X and output vector y by 










-e у 
|| 
БЕ ГЕ yw 

Nx (d--1) Nx1 


Ө calculate pseudo-inverse X! 
и 
(d+1)xN 
Ө return Win = Xly 
Мм 
(d+1)x1 


simple and efficient 
with good | routine | 


Цпеаг Regression Linear Regression Algorithm 


Fun Time 


After getting Ww, у, we can calculate the predictions у, = мү ux. If all ў, 
are collected in a vector У similar to how we form y, what is the matrix 
formula of $? 


Фу 

Ө хх'у 

© хх!у 

Ө ххїххТу 











Reference Answer: 9 


Note that Y = Хм, м. Then, a simple 
substitution of Wi iÑ reveals the answer. 
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Linear Regression Generalization Issue 


Is Linear Regression a ‘Learning Algorithm’? 


Win = xiy | 











e analytic (closed-form) 
solution, ‘instantaneous’ 

e not improving Ej, nor 
Eout iteratively 


e good En? 
yes, optimal! 
e good Es? 
yes, finite dc like perceptrons 
e improving iteratively? 
somewhat, within an iterative 
pseudo-inverse routine 





if Ecut(Wi iN) is good, learning ‘happened’! | 
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Ипеаг Regression Generalization Issue 


Benefit of Апају с Solution: 
‘Simpler-than-VC’ Guarantee 






ПЕ МА. D)} tobe sown noise level - (1 id — 
P 





1 ^ 1 
En(Wun) = yly- (У, 12 = У-ХХУЙ 
predictions ШИ 
1 
"I —xxlYI? 
= ql L = ХХ) 
identity 


call XX! the hat matrix H 
because it puts ^ on y 


Linear Regression Generalization Issue 


Geometric View of Hat Matrix 











span of X 


• 9 = Хмм Within the span of X columns 
° y — smallest: y — 9 | span 

° H: project y to У € span 

e 1— H: transform y to y — 9 | span 


claim: trace(I — H) = N — (d + 1). Why? :-) 
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Цпеаг Regression Generalization Issue 


An Illustrative ‘Proof’ 


span of X 





e if y comes from some ideal f(X) € span plus noise 
e noise transformed by I — H to be у — ӯ 


Та — H)noise|? 


= }(N-(d+1))|\noise||? 


1 А 
Ei (М.м) = У T У: 





= noise level. (1 — 91) 


noise level · (1 + 95") (complicated!) 


| m 
=j 
| 
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Linear Regression Generalization Issue 


The Learning Curve 


Eout = noise level · (1 + 9+! 
Ein = noise level . (1 — 257) 








Expected Error 





+1 Number of Data Points, N 
• both converge 10 о? (noise level) for N — со 


• expected generalization error: 29+) 


—similar to worst-case guarantee from VC 


linear regression (LinReg): 
learning ‘happened’! | 
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Linear Regression Generalization Issue 


Fun Time 


Which of the following property about H is not true? 
Ө His symmetric 
Ө H =H (double projection = single one) 

Ө (1— H)? =I— H (double residual transform = single опе) 
© none of the above 

















Reference Answer: (4) 


You can conclude that (2) and (3) are true by 
their physical meanings! :-) 
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Цпеаг Regression Linear Regression for Binary Classification 


Linear Classification vs. Linear Regression 


Linear Classification Linear Regression 








У = {-1,+1} ЈУ = Im 
h(x) = sign(w!'x) h(x) = wix 
еп(ў,у) = [7 #0] СО ec ЕЛЬ 


NP-hard їо solve іп general 





efficient analytic solution 


{—1,+1} C R: linear regression for classification? 


@ run LinReg on binary classification data D (efficient) 
Ө return g(x) = sign(w/,,.x) 


but explanation of this heuristic? | 
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Linear Regression Linear Regression for Binary Classification 


Relation of Two Errors 


спој = | si gn(w x) = y| errsqr = (wx = y) 





desired у = 1 desired у = —1 


— squared 
— 0/1 


err 

















Linear Regression Linear Regression for Binary Classification 


Linear Regression for Binary Classification 
erro/1 < errsqr | 


vC 
classification Еош(м) < classification Ei(w) + ./...... 
S 


e (loose) upper bound errsor as err to approximate erro A 
e trade bound tightness for efficiency 





Мк: Useful baseline classifier, 
or as initial PLA/pocket vector | 


Linear Regression Linear Regression for Binary Classification 


Fun Time 


Which of the following functions are upper bounds of the 
pointwise 0/1 error for y € {-1,+1}? 















© exp(-yw'x) 
Ө max(0, 1 — yw’x) 

© log? (1 + exp( -yw' x) 
© all of the above 





Reference Answer: (4) 


Plot the curves and you'll see. Thus, all three 
can be used for binary classification. In fact, all 
three functions connect to very important 
algorithms in machine learning and we will 
discuss one of them soon in the next lecture. 
Stay tuned. :-) 
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Linear Regression Linear Regression for Binary Classification 


Summary 
@ When Can Machines Learn? 


@ Why Can Machines Learn? 


Lecture 8: Noise and Error 


@ How Can Machines Learn? 


Lecture 9: Linear Regression 


ə Linear Regression Problem 
use hyperplanes to approximate real values 
ə Linear Regression Algorithm 
analytic solution with pseudo-inverse 
ə Generalization Issue 
Eut Еп шы: оп ауегаде 
e Linear Regression for Binary Classification 
0/1 error < squared error 





e next: binary classification, regression, and then? 
@ Ном Can Machines Learn Better? 
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